The data and modelling objects created in this notebook can be downloaded directly to save computational time.
Users who wish to complete the exercises can download a small
template R script. Assuming you have already downloaded the
data objects above, this script will load all data objects so that the
steps used to create them are not necessary to tackle the exercises.
This tutorial relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:
library(dplyr)
library(mvgam)
library(gratia)
library(ggplot2)
library(marginaleffects)
library(zoo)
library(viridis)
We will work with time series of rodent captures from the Portal Project, a long-term monitoring study based near the town of Portal, Arizona. Researchers have been operating a standardized set of baited traps within 24 experimental plots at this site since the 1970’s. Sampling follows the lunar monthly cycle, with observations occurring on average about 28 days apart. However, missing observations do occur due to difficulties accessing the site (weather events, COVID disruptions etc..). You can read about the full sampling protocol in this preprint by Ernest et al on the Biorxiv.
Portal Project sampling scheme in the desert near Portal, Arizona, USA; photo by SKM Ernest
All data from the Portal Project are made openly available in near real-time so that they can provide the maximum benefit to scientific research and outreach (a set of open-source software tools make data readily accessible). These data are extremely rich, containing monthly counts of rodent captures for >20 species. But rather than accessing the raw data, we will use some data that I have already processed and put into a simple, usable form
data("portal_data")
As the data come pre-loaded with the mvgam package, you
can read a little about it in the help page using
?portal_data. Before working with data, it is important to
inspect how the data are structured. There are various ways to inspect
data in R; I typically find the head function
to be convenient
head(portal_data)
## moon DM DO PP OT year month mintemp precipitation ndvi
## 1 329 10 6 0 2 2004 1 -9.710 37.8 1.465889
## 2 330 14 8 1 0 2004 2 -5.924 8.7 1.558507
## 3 331 9 1 2 1 2004 3 -0.220 43.5 1.337817
## 4 332 NA NA NA NA 2004 4 1.931 23.9 1.658913
## 5 333 15 8 10 1 2004 5 6.568 0.9 1.853656
## 6 334 NA NA NA NA 2004 6 11.590 1.4 1.761330
But the glimpse function in dplyr is also
useful for understanding how variables are structured
dplyr::glimpse(portal_data)
## Rows: 199
## Columns: 10
## $ moon <int> 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 3…
## $ DM <int> 10, 14, 9, NA, 15, NA, NA, 9, 5, 8, NA, 14, 7, NA, NA, 9…
## $ DO <int> 6, 8, 1, NA, 8, NA, NA, 3, 3, 4, NA, 3, 8, NA, NA, 3, NA…
## $ PP <int> 0, 1, 2, NA, 10, NA, NA, 16, 18, 12, NA, 3, 2, NA, NA, 1…
## $ OT <int> 2, 0, 1, NA, 1, NA, NA, 1, 0, 0, NA, 2, 1, NA, NA, 1, NA…
## $ year <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 20…
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,…
## $ mintemp <dbl> -9.710, -5.924, -0.220, 1.931, 6.568, 11.590, 14.370, 16…
## $ precipitation <dbl> 37.8, 8.7, 43.5, 23.9, 0.9, 1.4, 20.3, 91.0, 60.5, 25.2,…
## $ ndvi <dbl> 1.4658889, 1.5585069, 1.3378172, 1.6589129, 1.8536561, 1…
We will focus analyses on the time series of captures for one specific rodent species, the Desert Pocket Mouse Chaetodipus penicillatus. This species is interesting in that it goes into a kind of “hibernation” during the colder months, leading to very low captures during the winter period
Our study species, the Desert Pocket Mouse (Chaetodipus penicillatus); photo by SKM Ernest
Manipulating the data into a ‘long’ format is necessary for modelling
in mvgam, mgcv and brms. By
‘long’ format, we mean that each series x time observation
needs to have its own entry in the dataframe or
list object that we wish to use as data for modelling. A
simple example can be viewed by simulating data using the
sim_mvgam function. See ?sim_mvgam for more
details
data <- sim_mvgam(n_series = 4, T = 24)
head(data$data_train, 12)
## y season year series time
## 1 0 1 1 series_1 1
## 2 0 1 1 series_2 1
## 3 1 1 1 series_3 1
## 4 1 1 1 series_4 1
## 5 2 2 1 series_1 2
## 6 2 2 1 series_2 2
## 7 1 2 1 series_3 2
## 8 2 2 1 series_4 2
## 9 0 3 1 series_1 3
## 10 0 3 1 series_2 3
## 11 1 3 1 series_3 3
## 12 0 3 1 series_4 3
Notice how we have four different time series in these simulated
data, but we do not spread these out into different columns. Rather,
there is only a single column for the outcome variable, labelled
y in these simulated data. We also must supply a variable
labelled time to ensure the modelling software knows how to
arrange the time series when building models. This setup still allows us
to formulate multivariate time series models, as we will see in later
tutorials. Below are the steps needed to shape our
portal_data object into the correct form. First, we create
a time variable, select the column representing counts of
our target species (PP), and select appropriate variables
that we can use as predictors
portal_data %>%
# mvgam requires a 'time' variable be present in the data to index
# the temporal observations. This is especially important when tracking
# multiple time series. In the Portal data, the 'moon' variable indexes the
# lunar monthly timestep of the trapping sessions
dplyr::mutate(time = moon - (min(moon)) + 1) %>%
# We can also provide a more informative name for the outcome variable, which
# is counts of the 'PP' species (Chaetodipus penicillatus) across all control
# plots
dplyr::mutate(count = PP) %>%
# The other requirement for mvgam is a 'series' variable, which needs to be a
# factor variable to index which time series each row in the data belongs to.
# Again, this is more useful when you have multiple time series in the data
dplyr::mutate(series = as.factor('PP')) %>%
# Select the variables of interest to keep in the model_data
dplyr::select(series, year, time, count, mintemp, ndvi) -> model_data
The data now contain six variables:
series, a factor indexing which time series each
observation belongs to
year, the year of sampling
time, the indicator of which time step each observation
belongs to
count, the response variable representing the number of
captures of the species PP in each sampling
observation
mintemp, the monthly average minimum temperature at each
time step
ndvi, the monthly average Normalized Difference Vegetation
Index at each time step
Now check the data structure again
head(model_data)
## series year time count mintemp ndvi
## 1 PP 2004 1 0 -9.710 1.465889
## 2 PP 2004 2 1 -5.924 1.558507
## 3 PP 2004 3 2 -0.220 1.337817
## 4 PP 2004 4 NA 1.931 1.658913
## 5 PP 2004 5 10 6.568 1.853656
## 6 PP 2004 6 NA 11.590 1.761330
dplyr::glimpse(model_data)
## Rows: 199
## Columns: 6
## $ series <fct> PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP…
## $ year <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 20…
## $ time <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ count <int> 0, 1, 2, NA, 10, NA, NA, 16, 18, 12, NA, 3, 2, NA, NA, 13, NA,…
## $ mintemp <dbl> -9.710, -5.924, -0.220, 1.931, 6.568, 11.590, 14.370, 16.520, …
## $ ndvi <dbl> 1.4658889, 1.5585069, 1.3378172, 1.6589129, 1.8536561, 1.76132…
You can also summarize multiple variables, which is helpful to search for data ranges and identify missing values
summary(model_data)
## series year time count mintemp
## PP:199 Min. :2004 Min. : 1.0 Min. : 0.00 Min. :-24.000
## 1st Qu.:2008 1st Qu.: 50.5 1st Qu.: 2.50 1st Qu.: -3.884
## Median :2012 Median :100.0 Median :12.00 Median : 2.130
## Mean :2012 Mean :100.0 Mean :15.14 Mean : 3.504
## 3rd Qu.:2016 3rd Qu.:149.5 3rd Qu.:24.00 3rd Qu.: 12.310
## Max. :2020 Max. :199.0 Max. :65.00 Max. : 18.140
## NA's :36
## ndvi
## Min. :0.2817
## 1st Qu.:1.0741
## Median :1.3501
## Mean :1.4709
## 3rd Qu.:1.8178
## Max. :3.9126
##
We have some NAs in our response variable
count. Let’s visualize the data as a heatmap to get a sense
of where these are distributed (NAs are shown as red bars
in the below plot)
image(is.na(t(model_data %>%
dplyr::arrange(dplyr::desc(time)))), axes = F,
col = c('grey80', 'darkred'))
axis(3, at = seq(0,1, len = NCOL(model_data)), labels = colnames(model_data))
These observations will generally be thrown out by most modelling
packages in R. But as you will see when we work through the
tutorials, mvgam keeps these in the data so that
predictions can be automatically returned for the full dataset.
count,
mintemp and ndvi variables as histograms to
get a better sense of the types of values they can take. See
?hist for guidancecount are
zeros. See ?which and ?== for guidancecount
# the which function is useful for finding indices in a variable that fit a certain pattern
which(model_data$count == 0)
# to see how many zeros there are, calculate the length of this set of indices
length(which(model_data$count == 0))
# to calculate proportion of observations that are zero, divide this by the total number of observations
length(which(model_data$count == 0)) / length(model_data$count)
“The first thing to do in any data analysis task is to plot the data. Graphs enable many features of the data to be visualised, including patterns, unusual observations, changes over time, and relationships between variables. The features that are seen in plots of the data must then be incorporated, as much as possible, into the forecasting methods to be used” (Hyndman and Athanasopoulos; Forecasting: Principles and Practice (3rd ed))
A time series is a set of observations taken sequentially in time. When working with time series, the first step (after inspecting the data structure) is usually to plot the observations over time, as either a line plot or scatterplot
plot(x = model_data$time, y = model_data$count, type = 'l',
xlab = 'Time', ylab = 'Counts', bty = 'l', col = 'darkred',
lwd = 2)
This plot reveals several important features including: missing data, evidence of repeated seasonal cycles, lower bounding at zero and long-term undulations. Most of these features make it very challenging to model series like this using conventional time series / forecasting software packages. But before we discuss modelling approaches, let’s build some more visuals to look for other key features of the data
Lag plots are often used to inspect temporal autocorrelations at a
variety of lags in the observations for a given univariate time series.
While correlation measures the extent of a linear relationship between
two variables, autocorrelations are used to measure linear relationships
between lagged values of a time series. Ordinarily we’d want to plot our
outcome of interest (count), but the default
lag.plot can’t handle missing values in the time series
(more on that below). But we can inspect correlations among observed
NDVI values (using the ndvi variable) and lagged versions
of ndvi to see evidence of strong positive autocorrelation
at several lags
lag.plot(model_data$ndvi, lags = 4, pch = 16)
This function works by taking differences of the variable
ndvi and plotting the lagged value (i.e. the value of the
series at time \(t-lag\)) vs the
observed value at time \(t\). We can
replicate this process for a single lag so it is more obvious what is
happening under the hood
ndvi_lag1 <- diff(model_data$ndvi, lag = 1)
plot(ndvi_lag1, model_data$ndvi[2:NROW(model_data)],
xlab = expression("NDVI"["t-1"]),
ylab = expression("NDVI"["t"]), pch = 16, bty = 'l')
A Pearson’s correlation test confirms that there is a strong positive
relationship between current ndvi and what
ndvi was one month earlier
cor.test(ndvi_lag1, model_data$ndvi[2:NROW(model_data)])
##
## Pearson's product-moment correlation
##
## data: ndvi_lag1 and model_data$ndvi[2:NROW(model_data)]
## t = 6.736, df = 196, p-value = 1.765e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3130538 0.5403426
## sample estimates:
## cor
## 0.4335688
But unfortunately the lag plot will fail when the variable of
interest has missing values, which our outcome of interest
(count) has plenty of. Rather than using a lag plot, we can
instead plot the autocorrelation function at a variety of lags. These
plots are often used for deciding which moving average order
to choose when building autoregressive models
acf(model_data$count,
na.action = na.pass,
bty = 'l',
lwd = 2,
ci.col = 'darkred',
main = 'Counts of PP trappings')
Several features are evident from this plot: First, we observe positive autocorrelation at small lags. When time series show a trend, autocorrelations for small lags will often to be large and positive because observations nearby in time have similar values. So the ACF suggests there is evidence of a trend as autocorrelations tend to slowly decrease as the lags increase. Second, we see evidence of seasonality in captures. When data are seasonal, the autocorrelations will often be larger for the seasonal lags (at multiples of the seasonal period) than for other lags. We know this species shows evidence of torpor in the winter months, so it is not surprising to see a strong annual pattern in these autocorrelations.
The partial autocorrelation function shows something similar, but has controlled for the short-term lagged correlations. These plots are often used for selecting orders of autoregressive models. For example, if we choose an AR2 model then we hope our AR2 parameter will capture remaining autocorrelations at lag 2 when controlling for any autocorrelations at lag 1.
pacf(model_data$count,
na.action = na.pass,
bty = 'l',
lwd = 2,
ci.col = 'darkred',
main = 'Counts of PP trappings')
Both of these plots suggest there is temporal autocorrelation
present, as well as strong seasonality. We can view the ACF and some
other useful features of the data with one of mvgam’s
primary functions, plot_mvgam_series (see
?plot_mvgam_series for details):
plot_mvgam_series(data = model_data, y = 'count')
Another standard vizualisation is to compute and plot a Seasonal
Trend Decomposition. As we have seen from our example plots so far, time
series data can exhibit a variety of patterns including trend and
seasonality. It is often helpful to split a time series into several
components, each representing an underlying pattern, to inspect features
of the data that may not be immediately apparent when simply viewing the
series. There are many variations of these decompositions, but the most
popular is probably the STL decomposition, which decomposes a time
series into seasonal, trend and irregular components using
loess smoothing (see ?stl for details). But
just to get to a point where we can use this with most ecological
series, we have to somehow deal with missing observations. The usual
practice is to impute these in some way. We can use the
na.approx function from the zoo package for
this, which will use linear interpolation to impute missing values. This
requires that we first convert the count observations to a
time series object of class ts. We set the frequency to
12 as this is close enough to what we have with our lunar
monthly observation intervals. We will not be using the imputed data for
any modelling, so please don’t worry too much if you aren’t up to speed
on how linear interpolation works. This is purely for visualization
purposes
# First convert the counts to a time series object of class ts
count_ts <- ts(model_data$count, frequency = 12)
# Impute NAs using linear interpolation
interp_series <- zoo::na.approx(count_ts)
Now we can perform the STL decomposition. This requires us to specify the seasonal window, which controls the span in lags over which the seasonal loess smooth will operate. Smaller values allow for more rapid changes by setting a smaller number of consecutive years to estimate each value in the seasonal component. A key advantage of this method is that the seasonal component is allowed to change over time. But a drawback is that this parameter must be chosen by the analyst, and changing its value can give different results. We will use a 5-year seasonal window in this first example
plot(stl(interp_series, s.window = 5))
The strong seasonality is obvious in this plot, as is the evidence of
an undulating trend. There is also a lot of variation in the
remainder component, which demonstrates that the
seasonality and trend alone cannot capture all of the variation in the
observations over time. The sizes of the bars on the right-hand side of
these plots show scale to help us understand the relative magnitudes of
each component. But what happens if we change the seasonal window and
force it to be stable through time?
plot(stl(interp_series, s.window = 'periodic'))
Here the remainder component is even more important, and
the trend has become less smooth. Try varying the s.window
argument to see how the conclusions about component contributions will
change. For more information on decompositions, see this chapter in Applied Time Series Analysis for Fisheries and
Environmental Sciences, by Holmes, Scheuerell and Ward
count and ndvi at lags of 1 - 15 months. This
procedure is often used to help identify lags of the x-variable that
might be useful predictors of the outcome. Note, you will need to use
na.action = na.pass so that the missing values in
count don’t cause problems. See ?ccf for
details about calculating cross-correlations using
type = 'correlation'. Which lags of ndvi would
you consider to be the most likely candidates for predicting our outcome
count?STL decomposition for the ndvi
variableccf
# fill in the "?" with the correct variable / value
ccf(x = ?,
y = ?,
na.action = ?,
lag.max = ?,
# compute correlations at each lag
type = 'correlation',
# not necessary but makes for a prettier plot
bty = 'l',
lwd = 2,
ci.col = 'darkred',
main = expression(paste(italic(Cor),"(",ndvi[lag],",",count, ")")))
Now onto some modelling! Our first task will be to fit a Generalized
Linear Model (GLM) that can adequately capture the features of our
count observations (integer data, lower bound at zero,
missing values) while also attempting to model temporal variation. We
are almost ready to fit our first model, which will be a GLM with
Poisson observations, a log link function and random (hierarchical)
intercepts for year. This will allow us to capture our
prior belief that, although each year is unique, having been sampled
from the same population of effects, all years are connected and thus
might contain valuable information about one another. This will be done
by capitalizing on the partial pooling properties of hierarchical
models:
General structure of hierarchical models; credit to Bayes Rules!, by Johnson, Ott, and Dogucu (2021); reproduced under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Hierarchical (also known as random) effects offer many advantages
when modelling data with grouping structures (i.e. multiple species,
locations, years etc…). The ability to incorporate these in time series
models is a huge advantage over traditional models such as ARIMA or
Exponential Smoothing. But before we fit the model, we will need to
convert year to a factor so that we can use a random effect
basis in mvgam. See ?smooth.terms and
?smooth.construct.re.smooth.spec for details about the
re basis construction that is used by both
mvgam and mgcv
model_data %>%
# Convert 'year' to a factor variable
dplyr::mutate(year = factor(year)) -> model_data
Preview the dataset to ensure year is now a factor with a unique factor level for each year in the data
dplyr::glimpse(model_data)
## Rows: 199
## Columns: 6
## $ series <fct> PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP…
## $ year <fct> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 20…
## $ time <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ count <int> 0, 1, 2, NA, 10, NA, NA, 16, 18, 12, NA, 3, 2, NA, NA, 13, NA,…
## $ mintemp <dbl> -9.710, -5.924, -0.220, 1.931, 6.568, 11.590, 14.370, 16.520, …
## $ ndvi <dbl> 1.4658889, 1.5585069, 1.3378172, 1.6589129, 1.8536561, 1.76132…
levels(model_data$year)
## [1] "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"
## [11] "2014" "2015" "2016" "2017" "2018" "2019" "2020"
We are now ready for our first mvgam model. The syntax
will be familiar to users who have previously built models with
mgcv. But for a refresher, see ?formula.gam
and the examples in ?gam. Random effects can be specified
using the s wrapper with the re basis. Note
that we can also suppress the primary intercept using the usual
R formula syntax - 1. mvgam has a
number of possible observation families that can be used, see
?mvgam_families for more information. We will use
Stan as the fitting engine, which deploys Hamiltonian Monte
Carlo (HMC) for full Bayesian inference. By default, 4 HMC chains will
be run using a warmup of 500 iterations and collecting 500 posterior
samples from each chain. Interested users should consult the Stan user’s guide for more information
about the software and the enormous variety of models that can be
tackled with HMC.
model1 <- mvgam(count ~ s(year, bs = 're') - 1,
family = poisson(),
data = model_data)
Once the model has finished, the first step is to inspect the
summary to ensure no major diagnostic warnings have been
produced and to quickly summarise posterior distributions for key
parameters
summary(model1)
## GAM formula:
## count ~ s(year, bs = "re") - 1
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 199
##
## Status:
## Fitted using Stan
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## s(year).1 1.79166375 2.053640 2.298871 1.00 2319
## s(year).2 2.52290975 2.686530 2.841528 1.00 3344
## s(year).3 2.95633975 3.090700 3.218242 1.00 2679
## s(year).4 3.13489000 3.255710 3.376575 1.00 2294
## s(year).5 1.95044550 2.127760 2.313652 1.00 3170
## s(year).6 1.50514700 1.751240 1.972611 1.00 2832
## s(year).7 1.79093825 2.044105 2.288119 1.00 2340
## s(year).8 2.82005750 2.961230 3.106683 1.00 3429
## s(year).9 3.13490300 3.248450 3.361222 1.00 2504
## s(year).10 2.58106650 2.761645 2.925403 1.00 2857
## s(year).11 2.95924000 3.077680 3.198266 1.00 3413
## s(year).12 3.08999650 3.212885 3.333532 1.00 3178
## s(year).13 2.04297500 2.245830 2.440114 1.00 2635
## s(year).14 2.47765225 2.642915 2.803351 1.00 2612
## s(year).15 1.92716325 2.167520 2.377893 1.00 2361
## s(year).16 1.86200000 2.074965 2.275081 1.00 2774
## s(year).17 -0.05809112 1.112125 1.952538 1.01 404
##
## GAM group-level estimates:
## 2.5% 50% 97.5% Rhat n.eff
## mean(year) 2.0777965 2.4257500 2.734998 1.01 210
## sd(year) 0.4440271 0.6514505 1.039881 1.02 209
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
The diagnostic messages at the bottom of the summary show that the
HMC sampler did not encounter any problems or difficult posterior
spaces. This is a good sign. Posterior distributions for the
coefficients related to the GAM linear predictor (i.e. the \(\beta\)’s) can be extracted as a
data.frame or matrix as follows:
beta_post <- as.data.frame(model1, variable = 'betas')
dplyr::glimpse(beta_post)
## Rows: 2,000
## Columns: 17
## $ `s(year).1` <dbl> 2.19222, 2.15275, 2.13996, 2.19192, 1.89205, 2.02595, 2.0…
## $ `s(year).2` <dbl> 2.60606, 2.71602, 2.70798, 2.64010, 2.67649, 2.62063, 2.8…
## $ `s(year).3` <dbl> 3.07285, 3.09481, 3.04453, 3.08872, 3.09278, 2.95964, 2.9…
## $ `s(year).4` <dbl> 3.30948, 3.27944, 3.30520, 3.14628, 3.34322, 3.33284, 3.3…
## $ `s(year).5` <dbl> 2.04624, 2.06391, 2.07594, 2.23780, 2.06812, 2.12038, 2.1…
## $ `s(year).6` <dbl> 1.70115, 1.61986, 1.63557, 1.74989, 1.79217, 1.90743, 1.6…
## $ `s(year).7` <dbl> 2.14936, 2.09258, 2.10297, 2.09118, 2.04359, 2.14322, 1.9…
## $ `s(year).8` <dbl> 2.87866, 3.04622, 3.09253, 2.76098, 2.93532, 2.93814, 2.9…
## $ `s(year).9` <dbl> 3.16014, 3.15865, 3.08779, 3.32928, 3.15021, 3.16348, 3.1…
## $ `s(year).10` <dbl> 2.66322, 2.71084, 2.65445, 2.83438, 2.76970, 2.87555, 2.6…
## $ `s(year).11` <dbl> 3.11679, 3.15587, 3.10185, 2.99862, 3.09361, 3.13482, 3.1…
## $ `s(year).12` <dbl> 3.21742, 3.18601, 3.15495, 3.19846, 3.18066, 3.16771, 3.1…
## $ `s(year).13` <dbl> 2.28211, 2.21183, 2.20409, 2.05887, 2.21813, 2.25994, 2.1…
## $ `s(year).14` <dbl> 2.70576, 2.71362, 2.69127, 2.54511, 2.70080, 2.50278, 2.7…
## $ `s(year).15` <dbl> 2.20730, 2.15673, 2.12767, 2.05567, 1.95951, 2.10820, 2.2…
## $ `s(year).16` <dbl> 1.99186, 1.91282, 1.83756, 2.30329, 2.06263, 2.06949, 2.0…
## $ `s(year).17` <dbl> 1.2771100, 0.9090630, 0.8404070, 1.4549500, 0.9490400, 1.…
With any model fitted in mvgam, the underlying
Stan code can be viewed using the code
function:
code(model1)
Stan code
## // Stan model code generated by package mvgam
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## int<lower=0> ytimes[n, n_series]; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## int<lower=0> n_nonmissing; // number of nonmissing observations
## int<lower=0> flat_ys[n_nonmissing]; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## int<lower=0> obs_ind[n_nonmissing]; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // random effect variances
## vector<lower=0>[1] sigma_raw;
## // random effect means
## vector[1] mu_raw;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## b[1:17] = mu_raw[1] + b_raw[1:17] * sigma_raw[1];
## }
## model {
## // prior for random effect population variances
## sigma_raw ~ exponential(0.5);
## // prior for random effect population means
## mu_raw ~ std_normal();
## // prior (non-centred) for s(year)...
## b_raw[1:17] ~ std_normal();
## {
## // likelihood functions
## flat_ys ~ poisson_log_glm(flat_xs,
## 0.0,b);
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## array[n, n_series] int ypred;
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){
## mus[1:n, s] = eta[ytimes[1:n, s]];
## ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
## }
## }
Now for interrogating the model. We can get some sense of the
variation in yearly intercepts from the summary above, but it is easier
to understand them using targeted plots. Plot posterior distributions of
the temporal random effects using plot.mvgam with
type = 're'. See ?plot.mvgam for more details
about the types of plots that can be produced from fitted
mvgam objects
plot(model1, type = 're')
How do these yearly intercepts translate into time-varying
predictions? To understand this, we can plot posterior hindcasts from
this model for the training period using plot.mvgam with
type = 'forecast'
plot(model1, type = 'forecast')
If you wish to extract these hindcasts for other downstream analyses,
the hindcast function can be used. This will return a list
object of class mvgam_forecast. In the
hindcasts slot, a matrix of posterior retrodictions will be
returned for each series in the data (only one series in our
example):
hc <- hindcast(model1)
str(hc)
## List of 13
## $ call :Class 'formula' language count ~ s(year, bs = "re") - 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## $ trend_call : NULL
## $ family : chr "poisson"
## $ trend_model : chr "None"
## $ drift : logi FALSE
## $ use_lv : logi FALSE
## $ fit_engine : chr "stan"
## $ type : chr "response"
## $ series_names : chr "PP"
## $ train_observations:List of 1
## ..$ PP: int [1:199] 0 1 2 NA 10 NA NA 16 18 12 ...
## $ test_observations : NULL
## $ hindcasts :List of 1
## ..$ PP: num [1:2000, 1:199] 7 11 4 6 9 7 6 8 6 11 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:199] "ypred[1,1]" "ypred[2,1]" "ypred[3,1]" "ypred[4,1]" ...
## $ forecasts : NULL
## - attr(*, "class")= chr "mvgam_forecast"
You can also extract these hindcasts on the linear predictor scale, which in this case is the log scale (our Poisson GLM used a log link function). Sometimes this can be useful for asking more targeted questions about drivers of variation:
hc <- hindcast(model1, type = 'link')
range(hc$hindcasts$PP)
## [1] -2.68149 3.46132
Objects of class mvgam_forecast have an associated plot
function as well:
plot(hc)
This plot can look a bit confusing as it seems like there is linear
interpolation from the end of one year to the start of the next. But
this is just due to the way the lines are automatically connected in
base R plots
In any regression analysis, a key question is whether the residuals
show any patterns that can be indicative of un-modelled sources of
variation. For GLMs, we can use a modified residual called the Dunn-Smyth,
or randomized quantile, residual. Inspect Dunn-Smyth residuals from
the model using plot.mvgam with
type = 'residuals'
plot(model1, type = 'residuals')
n_draws x n_timepoints. Calculate posterior median
residuals per time point and plot these as a line plot (hint, use
?apply and either ?median or
?quantile for guidance). Add a dashed horizontal line at
0 (hint, use ?abline for guidance). Do these
residuals look worrisome to you?# fill in the "?" with the correct variable / value
# extract residuals for our rodent species of interest
model1_resids <- model1$resids$PP
# check the dimensions of the residuals matrix
dim(model1_resids)
# calculate posterior median residuals per timepoint
median_resids <- apply(model1_resids,
MARGIN = ?,
FUN = ?)
# plot median residuals over time
plot(median_resids,
type = 'l',
# not necessary but makes for a prettier plot
lwd = 2,
col = 'darkred',
bty = 'l',
ylab = 'Median residual',
xlab = 'Time')
# add a horizontal dashed line at zero
abline(? = 0, lty = 'dashed')
These temporal random effects do not have a sense of “time”. Because
of this, each yearly random intercept is not restricted in some way to
be similar to the previous yearly intercept. This drawback becomes
evident when we predict for a new year. To do this, we can repeat the
exercise above but this time will split the data into training and
testing sets before re-running the model. We can then supply the test
set as newdata. For splitting, we will make use of the
filter function from dplyr
model_data %>%
dplyr::filter(time <= 160) -> data_train
model_data %>%
dplyr::filter(time > 160) -> data_test
model1b <- mvgam(count ~ s(year, bs = 're') - 1,
family = poisson(),
data = data_train,
newdata = data_test)
Repeating the plots above gives insight into how the model’s hierarchical prior formulation provides all the structure needed to sample values for un-modelled years
plot(model1b, type = 're')
plot(model1b, type = 'forecast')
We can also view the test data in the forecast plot to see that the predictions do not capture the temporal variation in the test set
plot(model1b, type = 'forecast', newdata = data_test)
## Out of sample DRPS:
## [1] 181.994
As with the hindcast function, we can use the
forecast function to automatically extract the posterior
distributions for these predictions. This also returns an object of
class mvgam_forecast, but now it will contain both the
hindcasts and forecasts for each series in the data:
fc <- forecast(model1b)
str(fc)
## List of 13
## $ call :Class 'formula' language count ~ s(year, bs = "re") - 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## $ trend_call : NULL
## $ family : chr "poisson"
## $ trend_model : chr "None"
## $ drift : logi FALSE
## $ use_lv : logi FALSE
## $ fit_engine : chr "stan"
## $ type : chr "response"
## $ series_names : chr "PP"
## $ train_observations:List of 1
## ..$ PP: int [1:160] 0 1 2 NA 10 NA NA 16 18 12 ...
## $ test_observations :List of 1
## ..$ PP: int [1:39] NA 0 0 10 3 14 18 NA 28 46 ...
## $ hindcasts :List of 1
## ..$ PP: num [1:2000, 1:160] 12 4 7 5 6 10 5 11 12 6 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:160] "ypred[1,1]" "ypred[2,1]" "ypred[3,1]" "ypred[4,1]" ...
## $ forecasts :List of 1
## ..$ PP: num [1:2000, 1:39] 6 8 8 7 16 10 7 10 10 8 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:39] "ypred[161,1]" "ypred[162,1]" "ypred[163,1]" "ypred[164,1]" ...
## - attr(*, "class")= chr "mvgam_forecast"
Any users familiar with GLMs will know that we nearly always wish to
include predictor variables that may explain some of the variation in
our observations. This is also true when working with time series,
though the inclusion of predictors immediately rules out some of the
more popular forecasting algorithms (such as Exponential Smoothing). Predictors are easily
incorporated into GLMs / GAMs. Here, we will update the model from above
by including a parametric (fixed) effect of ndvi as a
linear predictor:
model2 <- mvgam(count ~ s(year, bs = 're') +
ndvi - 1,
family = poisson(),
data = data_train,
newdata = data_test)
summary(model2)
## GAM formula:
## count ~ s(year, bs = "re") + ndvi - 1
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 160
##
## Status:
## Fitted using Stan
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## ndvi 0.3326766 0.391681 0.4511045 1 1589
## s(year).1 1.1444945 1.397585 1.6605210 1 2093
## s(year).2 1.8086288 1.996410 2.1935925 1 2531
## s(year).3 2.1939975 2.371445 2.5550982 1 2047
## s(year).4 2.3301775 2.499675 2.6685193 1 1692
## s(year).5 1.2004423 1.418300 1.6364400 1 2288
## s(year).6 1.0208870 1.274220 1.5048660 1 2817
## s(year).7 1.1164623 1.409210 1.6845937 1 2791
## s(year).8 2.0945170 2.268220 2.4394002 1 2056
## s(year).9 2.7248317 2.851745 2.9862560 1 2388
## s(year).10 1.9700297 2.177440 2.3734320 1 2452
## s(year).11 2.2804748 2.435230 2.5914695 1 1903
## s(year).12 2.5444853 2.688195 2.8287460 1 2131
## s(year).13 1.3758935 1.607800 1.8428565 1 2437
## s(year).14 0.7292172 1.961535 3.2808082 1 1745
## s(year).15 0.6676110 1.980015 3.2593057 1 1401
## s(year).16 0.5908926 1.991000 3.3315577 1 1552
## s(year).17 0.5507315 1.979835 3.2677300 1 1584
##
## GAM group-level estimates:
## 2.5% 50% 97.5% Rhat n.eff
## mean(year) 1.5765945 1.9693400 2.3245617 1.01 371
## sd(year) 0.4038167 0.5816975 0.9583115 1.01 356
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
Rather than printing the summary each time, we can also quickly look
at the posterior empirical quantiles for the fixed effect of
ndvi (and other linear predictor coefficients) using
coef:
coef(model2)
## 2.5% 50% 97.5% Rhat n.eff
## ndvi 0.3326766 0.391681 0.4511045 1 1589
## s(year).1 1.1444945 1.397585 1.6605210 1 2093
## s(year).2 1.8086288 1.996410 2.1935925 1 2531
## s(year).3 2.1939975 2.371445 2.5550982 1 2047
## s(year).4 2.3301775 2.499675 2.6685193 1 1692
## s(year).5 1.2004423 1.418300 1.6364400 1 2288
## s(year).6 1.0208870 1.274220 1.5048660 1 2817
## s(year).7 1.1164623 1.409210 1.6845937 1 2791
## s(year).8 2.0945170 2.268220 2.4394002 1 2056
## s(year).9 2.7248317 2.851745 2.9862560 1 2388
## s(year).10 1.9700297 2.177440 2.3734320 1 2452
## s(year).11 2.2804748 2.435230 2.5914695 1 1903
## s(year).12 2.5444853 2.688195 2.8287460 1 2131
## s(year).13 1.3758935 1.607800 1.8428565 1 2437
## s(year).14 0.7292172 1.961535 3.2808082 1 1745
## s(year).15 0.6676110 1.980015 3.2593057 1 1401
## s(year).16 0.5908926 1.991000 3.3315577 1 1552
## s(year).17 0.5507315 1.979835 3.2677300 1 1584
Look at the estimated effect of ndvi using
plot.mvgam with type = 'pterms'
plot(model2, type = 'pterms')
This plot indicates a positive linear effect of ndvi on
log(counts). But it may be easier to visualise using a
histogram, especially for parametric (linear) effects. This can be done
by first extracting the posterior coefficients as we did in the first
example:
beta_post <- as.data.frame(model2, variable = 'betas')
dplyr::glimpse(beta_post)
## Rows: 2,000
## Columns: 18
## $ ndvi <dbl> 0.361561, 0.346356, 0.364961, 0.362273, 0.351501, 0.37742…
## $ `s(year).1` <dbl> 1.44308, 1.42792, 1.36674, 1.61025, 1.51704, 1.22912, 1.5…
## $ `s(year).2` <dbl> 1.89607, 2.02703, 1.96253, 2.09319, 2.05888, 2.07440, 1.9…
## $ `s(year).3` <dbl> 2.51088, 2.43188, 2.31213, 2.40475, 2.45401, 2.27109, 2.4…
## $ `s(year).4` <dbl> 2.58948, 2.61004, 2.51078, 2.56168, 2.64230, 2.63496, 2.4…
## $ `s(year).5` <dbl> 1.48298, 1.41950, 1.33216, 1.66339, 1.54796, 1.44741, 1.3…
## $ `s(year).6` <dbl> 1.37901, 1.39659, 1.35724, 1.34916, 1.52628, 1.32565, 1.1…
## $ `s(year).7` <dbl> 1.51337, 1.63503, 1.41908, 1.55674, 1.60007, 1.29755, 1.5…
## $ `s(year).8` <dbl> 2.26475, 2.36173, 2.35163, 2.41688, 2.34515, 2.19913, 2.2…
## $ `s(year).9` <dbl> 2.85323, 2.99390, 2.91566, 2.78095, 2.78740, 2.95776, 2.9…
## $ `s(year).10` <dbl> 2.24990, 2.38087, 2.35128, 2.16558, 2.40147, 2.25287, 2.0…
## $ `s(year).11` <dbl> 2.44654, 2.56370, 2.42213, 2.47414, 2.57401, 2.40924, 2.5…
## $ `s(year).12` <dbl> 2.62789, 2.79551, 2.73994, 2.74291, 2.77003, 2.62836, 2.5…
## $ `s(year).13` <dbl> 1.63537, 1.61050, 1.57763, 1.66412, 1.86491, 1.65020, 1.5…
## $ `s(year).14` <dbl> 2.322990, 2.945500, 2.919210, 1.635940, 2.374750, 2.54293…
## $ `s(year).15` <dbl> 1.623840, 1.855830, 4.435550, 1.939410, 1.722940, 1.80244…
## $ `s(year).16` <dbl> 3.83239, 3.87271, 1.11025, 4.67180, 3.60005, 3.16664, 3.7…
## $ `s(year).17` <dbl> 1.620440, 2.117590, 2.468780, 2.949900, 3.449950, 3.47414…
The posterior distribution for the effect of ndvi is
stored in the ndvi column. A quick histogram confirms our
inference that log(counts) respond positively to increases
in ndvi:
hist(beta_post$ndvi,
xlim = c(-1 * max(abs(beta_post$ndvi)),
max(abs(beta_post$ndvi))),
col = 'darkred',
border = 'white',
xlab = expression(beta[NDVI]),
ylab = '',
yaxt = 'n',
main = '',
lwd = 2)
abline(v = 0, lwd = 2.5)
Given our model used a nonlinear link function (log link in this
example), it can still be difficult to fully understand what
relationship our model is estimating between a predictor and the
response. Fortunately, the
marginaleffects package makes
this relatively straightforward. Objects of class mvgam
contain a lightweight version of the underlying mgcv model
in which the posterior median coefficients and their empirical
covariances are stored. This means that we can make plots from this
object and they will be very close to the actual posterior that has been
estimated. Let’s have a look at the marginal effect of ndvi
on the outcome scale using the plot_predictions function
from marginaleffects (use ?plot_predictions
for guidance on how to modify these plots):
plot_predictions(model2$mgcv_model,
condition = "ndvi",
points = 0.5, rug = TRUE) +
theme_classic()
Now it is easier to get a sense of the nonlinear but positive
relationship estimated between ndvi and count.
But has this predictor made an impact on our estimated hindcasts and
forecasts? Visualise these plots as we did previously
plot(model2, type = 'forecast', newdata = data_test)
## Out of sample DRPS:
## [1] 150.1655
These predictions certainly look different. Are the residual diagnostics improved?
plot(model2, type = 'residuals')
ndvi and our outcome
countmintemp as a second predictor. Be sure to first
check whether there is a strong pairwise correlation between
mintemp and ndvi (hint, use
?cor.test for guidance)mintemp on
log(counts)? Or does inclusion of mintemp lead
to different inference on the effect of ndvi compared to
when mintemp was not in the model?The final point to cover in this tutorial is to incorporate nonlinear
functions of time in place of the temporal yearly random effects.
Nonlinear splines are commonly viewed as variations of random effects in
which the coefficients that control the shape of the spline are drawn
from a joint, penalized distribution. This strategy is very often used
in ecological time series analysis to capture smooth temporal variation
in the processes we seek to study. The lectures have gone into detail
about smoothing splines and basis functions, but it is worth quickly
re-iterating some of this detail here. When we construct smoothing
splines, the workhorse package mgcv will calculate a set of
basis functions that will collectively control the shape and complexity
of the resulting spline. It is often helpful to visualize these basis
functions to get a better sense of how splines work. We’ll create a set
of 6 basis functions to represent possible variation in the effect of
time on our outcome. This can easily be done using
functions in the gratia package:
basis_funs <- basis(s(time, bs = 'tp', k = 6),
data = model_data)
Once the functions are created, the next step is to plot them (using
ggplot2 utilities)
draw(basis_funs) +
theme_classic() +
labs(y = 'Basis function value', x = 'Time')
Upping the potential complexity of the smooth function, by increasing
k (to 10 in this example), leads to a larger diversity of
basis functions:
draw(basis(s(time, bs = 'tp', k = 10),
data = model_data)) +
theme_classic() +
labs(y = 'Basis function value', x = 'Time')
R routines
# Set up the smooth object that is needed to predict values of
# basis functions
temp_gam <- gam(count ~ s(time, bs = 'tp', k = 6),
data = model_data,
family = poisson(), fit = FALSE)
# We can understand how this penalty matrix is used to
# translate values of `time` into basis functions by
# generating a sequence of `time` values for sampling
# prior function realisations
cov_seq <- seq(min(model_data$time),
max(model_data$time),
length.out = 500)
summary(cov_seq)
# Generate the linear predictor matrix for this
# sequence of `time` values
Xp <- mgcv::PredictMat(temp_gam$smooth[[1]],
data.frame(time = cov_seq))
str(Xp)
# This linear predictor contains 5 columns, one for each
# function; the sequences of values in the columns represent
# the actual basis functions over the range of simulated
# `time` values
par(mar = c(4,4,2,1.5))
plot(1, type = 'n',
ylim = range(Xp),
xlim = range(cov_seq),
bty = 'l',
ylab = 'Basis function value',
xlab = 'Time')
cols <- viridis::viridis(n = NCOL(Xp), alpha = 0.8)
for(i in 1:NCOL(Xp)){
lines(x = cov_seq, y = Xp[,i], lwd = 3, col = 'white')
lines(x = cov_seq, y = Xp[,i], lwd = 2.5, col = cols[i])
}
box(bty = 'l', lwd = 2)
mvgamIn addition to constructing the basis functions, mgcv
also creates a penalty matrix \(S\),
which contains known coefficients that work to
constrain the wiggliness of the resulting smooth function. When fitting
a GAM to data, we must estimate the smoothing parameters (\(\lambda\)) that will penalize these
matrices, resulting in constrained basis coefficients and smoother
functions that are less likely to overfit the data. This is the key to
fitting GAMs in a Bayesian framework, as we can jointly estimate the
\(\lambda\)’s using informative priors
to prevent overfitting and expand the complexity of models we can
tackle. To see this in practice, we can now fit a model that replaces
the yearly random effects with a smooth function of time.
We will need a reasonably complex function (large k) to try
and accommodate the temporal variation in our observations. Following
some useful advice by Gavin Simpson, we will use a
b-spline basis for the temporal smooth. Because we no longer have
intercepts for each year, we also retain the primary intercept term in
this model (there is no -1 in the formula now):
model3 <- mvgam(count ~ s(time, bs = 'bs', k = 15) +
ndvi +
mintemp,
family = poisson(),
data = data_train,
newdata = data_test)
summary(model3)
## GAM formula:
## count ~ s(time, bs = "bs", k = 15) + ndvi + mintemp
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 160
##
## Status:
## Fitted using Stan
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## (Intercept) 2.19160375 2.3034050 2.41255550 1.00 870
## ndvi -0.11447937 -0.0494094 0.01804588 1.00 899
## mintemp 0.06152594 0.0664635 0.07110312 1.00 2905
## s(time).1 -1.72709050 -0.7546920 0.12619987 1.01 519
## s(time).2 -0.03049798 0.8725010 1.79732050 1.01 349
## s(time).3 -0.47699412 0.5651535 1.46456425 1.01 315
## s(time).4 1.74309800 2.6390200 3.59201975 1.01 325
## s(time).5 -1.32690050 -0.2616110 0.62881187 1.01 337
## s(time).6 -0.79174422 0.1250875 1.13450950 1.01 341
## s(time).7 -1.53458900 -0.4224735 0.53058190 1.01 339
## s(time).8 0.38086545 1.3041850 2.22969100 1.01 306
## s(time).9 0.66901118 1.6320400 2.52316200 1.01 303
## s(time).10 -0.49025770 0.5032930 1.39643350 1.01 297
## s(time).11 0.50026635 1.4685250 2.38813775 1.01 298
## s(time).12 0.34179292 1.2184650 2.07458075 1.01 306
## s(time).13 -1.67890750 -0.8422010 -0.01012554 1.01 411
## s(time).14 -5.83556225 -2.9188550 0.11683440 1.01 339
##
## GAM observation smoothing parameter (rho) estimates:
## 2.5% 50% 97.5% Rhat n.eff
## s(time) 0.08895923 1.017855 1.818376 1 1328
## s(time)2 0.79232885 3.055090 4.136762 1 869
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
The summary above now contains posterior estimates for the smoothing
parameters as well as the basis coefficients for the nonlinear effect of
time. We can visualize the conditional time
effect using the plot function with
type = 'smooths':
plot(model3, type = 'smooths')
By default this plots shows posterior empirical quantiles, but it can also be helpful to view some realizations of the underlying function (here, each line is a different potential curve drawn from the posterior of all possible curves):
plot(model3, type = 'smooths', realisations = TRUE,
n_realisations = 30)
A useful question when modelling using GAMs is to identify where the function is changing most rapidly. To address this, we can plot estimated 1st derivatives of the spline:
plot(model3, type = 'smooths', derivatives = TRUE)
Here, values above >0 indicate the function was
increasing at that time point, while values <0 indicate
the function was declining. The most rapid declines appear to have been
happening around timepoints 50 and again toward the end of the training
period, for example. To gain some idea of how the spline is being
penalized, we can again inspect the underlying Stan
code:
code(model3)
## // Stan model code generated by package mvgam
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_sp; // number of smoothing parameters
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## vector[num_basis] zero; // prior locations for basis coefficients
## real p_taus[3]; // prior precisions for parametric coefficients
## real p_coefs[3]; // prior locations for parametric coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## int<lower=0> ytimes[n, n_series]; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## matrix[14,28] S1; // mgcv smooth penalty matrix S1
## int<lower=0> n_nonmissing; // number of nonmissing observations
## int<lower=0> flat_ys[n_nonmissing]; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## int<lower=0> obs_ind[n_nonmissing]; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // smoothing parameters
## vector<lower=0>[n_sp] lambda;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## b[1:num_basis] = b_raw[1:num_basis];
## }
## model {
## // parametric effect priors (regularised for identifiability)
## for (i in 1:3) {
## b_raw[i] ~ normal(p_coefs[i], sqrt(1 / p_taus[i]));
## }
## // prior for s(time)...
## b_raw[4:17] ~ multi_normal_prec(zero[4:17],S1[1:14,1:14] * lambda[1] + S1[1:14,15:28] * lambda[2]);
## // priors for smoothing parameters
## lambda ~ normal(10, 25);
## {
## // likelihood functions
## flat_ys ~ poisson_log_glm(flat_xs,
## 0.0,b);
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## vector[n_sp] rho;
## array[n, n_series] int ypred;
## rho = log(lambda);
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){
## mus[1:n, s] = eta[ytimes[1:n, s]];
## ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
## }
## }
The line below // prior for s(time)... shows how the
spline basis coefficients are drawn from a zero-centred multivariate
normal distribution. The precision matrix \(S\) is penalized by two different smoothing
parameters (the \(\lambda\)’s) to
enforce smoothness and reduce overfitting. In the next tutorial, we will
expand on temporal smooths and understand their weaknesses for
generating out of sample forecasts
ndvi effect to a smooth function of
ndvi using the s() argument (hint, see
?s for guidance on how to define smooth terms in GAM
formulae using mgcv syntax). Inspect the resulting function
estimates for both ndvi and timek to a larger number, something like 50. Do you get any
sense that there are problems with fitting this model? Do any of your
conclusions change?sessionInfo()
## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] viridis_0.6.3 viridisLite_0.4.2 zoo_1.8-12
## [4] marginaleffects_0.13.0 ggplot2_3.4.2 gratia_0.8.1
## [7] mvgam_1.0.5 mgcv_1.8-42 nlme_3.1-162
## [10] dplyr_1.1.1 downloadthis_0.3.2
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.1-0 ellipsis_0.3.2 brms_2.19.0
## [4] markdown_1.7 base64enc_0.1-3 fs_1.6.2
## [7] rstudioapi_0.14 farver_2.1.1 rstan_2.21.8
## [10] smooth_3.2.1 DT_0.28 fansi_1.0.4
## [13] mvtnorm_1.2-2 lubridate_1.9.2 bridgesampling_1.1-2
## [16] codetools_0.2-19 splines_4.2.3 cachem_1.0.8
## [19] knitr_1.43 texreg_1.38.6 shinythemes_1.2.0
## [22] greybox_1.0.8 bayesplot_1.10.0 jsonlite_1.8.7
## [25] nloptr_2.0.3 bsplus_0.1.4 shiny_1.7.4
## [28] httr_1.4.6 compiler_4.2.3 backports_1.4.1
## [31] assertthat_0.2.1 Matrix_1.5-3 fastmap_1.1.1
## [34] cli_3.6.1 later_1.3.1 htmltools_0.5.5
## [37] prettyunits_1.1.1 tools_4.2.3 igraph_1.5.0
## [40] coda_0.19-4 gtable_0.3.3 glue_1.6.2
## [43] reshape2_1.4.4 posterior_1.4.1 Rcpp_1.0.10
## [46] jquerylib_0.1.4 vctrs_0.6.1 crosstalk_1.2.0
## [49] insight_0.19.3 tensorA_0.36.2 xfun_0.39
## [52] stringr_1.5.0 ps_1.7.5 timechange_0.2.0
## [55] mime_0.12 miniUI_0.1.1.1 lifecycle_1.0.3
## [58] gtools_3.9.4 statmod_1.5.0 klippy_0.0.0.9500
## [61] MASS_7.3-58.2 scales_1.2.1 colourpicker_1.2.0
## [64] promises_1.2.0.1 Brobdingnag_1.2-9 parallel_4.2.3
## [67] inline_0.3.19 shinystan_2.6.0 yaml_2.3.7
## [70] mvnfast_0.2.8 gridExtra_2.3 loo_2.6.0
## [73] StanHeaders_2.26.26 sass_0.4.7 stringi_1.7.12
## [76] highr_0.10 dygraphs_1.1.1.6 checkmate_2.2.0
## [79] pkgbuild_1.4.2 zip_2.3.0 rlang_1.1.1
## [82] pkgconfig_2.0.3 matrixStats_1.0.0 distributional_0.3.2
## [85] pracma_2.4.2 evaluate_0.21 lattice_0.20-45
## [88] purrr_1.0.1 labeling_0.4.2 rstantools_2.3.1
## [91] patchwork_1.1.2 htmlwidgets_1.6.2 tidyselect_1.2.0
## [94] processx_3.8.1 plyr_1.8.8 magrittr_2.0.3
## [97] R6_2.5.1 generics_0.1.3 pillar_1.9.0
## [100] withr_2.5.0 xts_0.13.1 abind_1.4-5
## [103] tibble_3.2.1 crayon_1.5.2 utf8_1.2.3
## [106] rmarkdown_2.23 grid_4.2.3 data.table_1.14.8
## [109] callr_3.7.3 threejs_0.3.3 digest_0.6.31
## [112] xtable_1.8-4 tidyr_1.3.0 httpuv_1.6.11
## [115] RcppParallel_5.1.7 stats4_4.2.3 munsell_0.5.0
## [118] bslib_0.5.0 shinyjs_2.1.0